Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)
نویسندگان
چکیده
The NTU-MC compilation taps on the linguistic diversity of multilingual texts available within Singapore. The current version of NTU-MC contains 595,000 words (26,000 sentences) in 7 languages (Arabic, Chinese, English, Indonesian, Japanese, Korean and Vietnamese) from 7 language families (Afro-Asiatic, Sino-Tibetan, Indo-European, Austronesian, Japonic, Korean as a language isolate and Austro-Asiatic). The NTU-MC is annotated with a layer of monolingual annotation (POS and sense tags) and cross-lingual annotation (sentence-level alignments). The diverse language data and cross-lingual annotations provide valuable information on linguistic diversity for traditional linguistic research as well as natural language processing tasks. This paper describes the corpus compilation process with the evaluation of the monolingual and cross-lingual annotations of the corpus data. The corpus is available under the Creative Commons – Attribute 3.0 Unported license (CC BY).
منابع مشابه
Tan Liling and Francis Bond . Building and Annotating the Linguistically Diverse NTU - MC ( NTU – Multilingual Corpus )
The NTU-MC compilation taps on the linguistic diversity of multilingual texts available within Singapore. The current version of NTU-MC contains 375,000 words (15,000 sentences) in 6 languages (English, Chinese, Japanese, Korean, Indonesian and Vietnamese) from 6 language families (Indo-European, Sino-Tibetan, Japonic, Korean as a language isolate, Austronesian and Austro-Asiatic). The NTU-MC i...
متن کاملNTU-MC Toolkit: Annotating a Linguistically Diverse Corpus
The NTU-MC Toolkit is a compilation of tools to annotate the Nanyang Technological University Multilingual Corpus (NTU-MC). The NTU-MC is a parallel corpora of linguistically diverse languages (Arabic, English, Indonesian, Japanese, Korean, Mandarin Chinese, Thai and Vietnamese). The NTU-MC thrives on the mantra of "more data is better data and more annotation is better information". Other than...
متن کاملDeveloping Parallel Sense-tagged Corpora with Wordnets
Semantically annotated corpora play an important role in natural language processing. This paper presents the results of a pilot study on building a sense-tagged parallel corpus, part of ongoing construction of aligned corpora for four languages (English, Chinese, Japanese, and Indonesian) in four domains (story, essay, news, and tourism) from the NTU-Multilingual Corpus. Each subcorpus is firs...
متن کاملThe MC-value for monotonic NTU-games
The MC-value is introduced as a new single-valued solution concept for monotonic NTU-games. The MC-value is based on marginal vectors, which aze extensions of the well-known marginal vectors for TU-games and hyperplane games. As a result of the definition it follows that the MC-value coincides with the Shapley value for TU-games and with the consistent Shapley value for hyperplane games. It is ...
متن کاملFrom Historic Books to Annotated XML: Building a Large Multilingual Diachronic Corpus
This paper introduces our approach towards annotating a large heritage corpus, which spans over 100 years of alpine literature. The corpus consists of over 16.000 articles from the yearbooks of the Swiss Alpine Club, 60% of which represent German texts, 38% French, 1% Italian and the remaining 1% Swiss German and Romansh. The present work describes the inherent difficulties in processing a mult...
متن کامل